library(tidyverse)
library(ggplot2)
library(dplyr)
library(gridExtra)
library(forecast)
remotes::install_github("nrennie/LondonMarathon")
data(winners, package = "LondonMarathon")
data(london_marathon, package = "LondonMarathon")

Project outline

Introduction

What is the data about?

The London Marathon, or the TCS London Marathon, is a truly special event that brings people from all walks of life together. It was the brainchild of visionaries Chris Brasher and John Disley back in 1981 and has since become the UK’s second-largest annual road race, second only to the famous Great North Run in Newcastle.

Usually happening in April, the marathon had a temporary switch to October in 2020, 2021, and 2022 due to the challenges posed by the COVID-19 pandemic. The course itself is known for its mostly flat terrain, winding its way around the beautiful River Thames, starting at Blackheath and ending with a grand finish at The Mall.

But the London Marathon isn’t just a race; it’s a melting pot of experiences. There’s the mass race, where people of all fitness levels challenge themselves and relish the thrill of completing a marathon. For elite long-distance runners, both men and women, there are professional races showcasing top-tier athleticism. Wheelchair races provide a platform for elite-level competition among both men and women.

Now, let’s dive into the details of the marathon with the comprehensive data:

London Marathon Data:

  • This dataset spills the beans on everything from the number of applicants to accepted participants, starters, finishers, funds raised, and the official charities associated with each year.

Winner Data:

  • The second set of data gives us the lowdown on the winners in different categories over the years, including details like the category, year, the winning athlete’s name, nationality, and their finishing time.

What truly sets the London Marathon apart is its heartwarming philanthropic side. Participants aren’t just running for personal glory; they’re making a difference. Since its start, the marathon has raised over £1 billion for various charities. In 2019 alone, an incredible £66.4 million was raised, marking it as the highest single-day fundraising event in the marathon’s history.

So, as the London Marathon continues to capture hearts and inspire runners worldwide, it’s not just a race; it’s a legacy of community, achievement.

Questions that we will look at and the methods we used

  1. Analyzing Trends in Winning Times Over the Years: Our examination of winning times involves creating time series plots for each category or utilizing regression analysis to identify significant trends. Through careful comparison and the use of visualizations, we aim to uncover patterns and changes in winning times across various categories. Exploring Wheelchair Winners’ Performances: Shifting our focus to inclusivity, we analyze the performances of wheelchair winners to understand how their achievements have evolved. A detailed examination of winning times helps us visualize trends, celebrating the success of wheelchair participants in the marathon.

  2. Comparing Winning Times Between Categories: Using t-tests or analysis of variance (ANOVA), we compare winning times between different categories such as Men vs. Women and Wheelchair Men vs. Wheelchair Women. This analysis enables the exploration of variations in performance and identification of any significant differences between these categories.

  3. Correlation Analysis: To delve deeper into the relationships within the data, we conduct correlation analysis.

  4. In our exploration of the rich history of the London Marathon, we delve into a detailed examination of different aspects. To begin, let’s uncover the evolving trends in marathon participation over the years by closely examining changes in the number of applicants and accepted participants. To give these numbers context and meaning, we calculate the percentage of participants who moved from being accepted to actually starting the race and, subsequently, the percentage of those starters who successfully finished. This analysis aims to reveal any noticeable trends or patterns, breathing life into the raw numbers.

In the realm of inclusivity, we shift our focus to wheelchair winners, seeking to understand how their performances have evolved over the years. Through a detailed analysis of winning times, we aim to visualize trends and patterns, celebrating the achievements of wheelchair participants in the marathon.

Turning our attention to the marathon heroes, we delve into the winner analysis. The spotlight is on the most successful marathon winners, and we aim to understand how their performances have evolved over time. By examining variables such as nationality, winning times, and identifying any discernible patterns, Visualizing winning times over the years adds a vivid layer to this exploration.

Geographical analysis becomes our compass as we investigate trends or patterns in the nationalities of winners. Employing visualizations through maps, we seek to represent the distribution of winners’ nationalities, painting a global picture of the marathon’s diverse champions.

In the financial realm, we track the yearly fundraising growth, exploring how the total fundraising amount has changed annually. Utilizing a line chart, we aim to capture the financial heartbeat of the London Marathon, identifying any significant spikes or drops and understanding the financial narrative that accompanies this iconic event.

we delve into long-term charity partnerships. By identifying charities that have consistently participated, we analyze their fundraising trends over time. This exploration unveils the enduring commitment of these charitable partners to the London Marathon and the impact of their contributions.

Lastly, we conduct a comparative analysis to understand how the number of male and female participants has changed over the years. Through visualizations, we aim to illustrate the evolving dynamics of gender participation in the London Marathon, showcasing the inclusive spirit of this remarkable event.

What information does the data contain? Cases, variables? How was it collected?

The data provided gives us insights into the London Marathon, offering details on the number of applicants, accepted participants, starters, finishers, funds raised, and the official charities involved for each year. Additionally, we have separate information on the winners across different categories such as Men, Women, Wheelchair Men, and Wheelchair Women throughout the years.

Let’s break down the key aspects:

Cases: Each row in the datasets represents a specific year of the London Marathon. There are 42 entries in the “London Marathon Data” and 165 entries in the “Winner.csv” data.

Variables in London Marathon Data:

  • Date: Reflecting the date of the London Marathon.
  • Year: Indicating the corresponding year of the London Marathon.
  • Applicants: The count of individuals applying to participate in the marathon.
  • Accepted: The number of applicants who secured a spot to participate.
  • Starters: The count of participants who initiated the marathon.
  • Finishers: The number of participants successfully completing the marathon.
  • Raised: The total amount of money raised during the marathon.
  • Official Charity: The designated charity associated with the marathon for that specific year.

Variables in Winner Data:

  • Category: Specifies the category of the winner (Men, Women, Wheelchair Men, Wheelchair Women).
  • Year: Denoting the year when the London Marathon took place.
  • Athlete: The name of the winning athlete.
  • Nationality: The country of origin of the winning athlete.
  • Time: The duration taken by the winning athlete to complete the marathon.

How was it collected?

  • The data seems to have been collected over the years, possibly sourced from official records, the official website of the London Marathon, event organizers, media outlets such as BBC, and other reliable sources associated with the event. The information could have been compiled from participant registrations, event statistics, official reports, and media coverage.

Data Wrangling

Which subset of the data is relevant to our research question? Have we filtered out any observations from the analysis? Have we created new variables?

For the winners dataset, all variables and observations are relevant. For the london_marathon dataset, only variables Year, Applicants, Accepted, Starters, Finishers are relevant to our research questions.

For the winners dataset, no observation is filtered out. For the london_marathon dataset, we filtered out the observation in year 2020-2022.

We created new variables, which are ratio of accepted, ratio of starters, and ratio of finishers, in the london_marathon dataset.

Are there missing values?

The winners dataset dos not contain any missing values, but the london_marathon dataset does. There are many methods to filling missing values. But we chose not to adopt a specific one, since those values are missing systematically. No appropriate appllys except we were able to figure out why they are missed.

Do variables have appropriate type?

Here is the data types summary.

library(knitr)
table <- data.frame('Column'=c('Category','Year','Athlete','Nationality','Time'), 'Class'=c('character','integer','character','character','character'), 'Description'=c('Category of race','Year','Name of the winner','Nationality of the winner','Winning time'),'Example'=c('Men',1981,'Dick Beardsley(Tie)','United States', '02:11:48'))
kable(table, caption = "Winners")
Winners
Column Class Description Example
Category character Category of race Men
Year integer Year 1981
Athlete character Name of the winner Dick Beardsley(Tie)
Nationality character Nationality of the winner United States
Time character Winning time 02:11:48
table2 <- data.frame('Column'=c('Date','Year','Applicants','Accepted','Starters','Finishers','Raised','Official.charity'),'Class'=c('character','integer','integer','integer','integer','integer','integer','character'),'Description'=c('Date of the race','Year','Number of people who applied','Number of people accepted','Number of people who started','Number of people who finished','Amount raised for charity(ÂŁ millions)','Official charity'),'Example'=c('1981-03-29','1981','20000','7747','7055','6255','46.5','SportsAid'))
kable(table2, caption = "London Marathon")
London Marathon
Column Class Description Example
Date character Date of the race 1981-03-29
Year integer Year 1981
Applicants integer Number of people who applied 20000
Accepted integer Number of people accepted 7747
Starters integer Number of people who started 7055
Finishers integer Number of people who finished 6255
Raised integer Amount raised for charity(ÂŁ millions) 46.5
Official.charity character Official charity SportsAid

Are there any strange aspects to the data?

In the london_marathon dataset, observation of number of applicants in year 1981-2012 are in a unit of 1000. And observations of number of applicants, accepted, starters, and finishers in year 2020 are abnormally low.

Winners(analysis & conclusions)

Anlysis of winners dataset

Overview of the dataset

head(winners)
## # A tibble: 6 Ă— 5
##   Category  Year Athlete              Nationality    Time      
##   <chr>    <dbl> <chr>                <chr>          <times>   
## 1 Men       1981 Dick Beardsley (Tie) United States  0.09152778
## 2 Men       1981 Inge Simonsen (Tie)  Norway         0.09152778
## 3 Men       1982 Hugh Jones           United Kingdom 0.08986111
## 4 Men       1983 Mike Gratton         United Kingdom 0.09008102
## 5 Men       1984 Charlie Spedding     United Kingdom 0.09024306
## 6 Men       1985 Steve Jones          United Kingdom 0.08907407

Compare winning times between different categories (i.e. How much difference there is between performance of men and women?)

boxplot(Time ~ Category, data = winners_data, desc= T,main="Boxplots of Winning Time of Each Category")

From this plot, we can see there is difference of performance between men and women.

# t-test for Men vs. Women
t_test_men_women <- t.test(Time ~ Category, data = subset(winners_data, Category %in% c("Men", "Women")))
print(t_test_men_women)
## 
##  Welch Two Sample t-test
## 
## data:  Time by Category
## t = -21.25, df = 69.816, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Men and group Women is not equal to 0
## 95 percent confidence interval:
##  -0.2865984 -0.2374147
## sample estimates:
##   mean in group Men mean in group Women 
##            2.121004            2.383010

t-value: The t-value of -21.25 is highly significant. It indicates that the difference in mean finishing times between Men and Women is substantial, and it is unlikely to be due to random chance. The negative sign suggests that, on average, Men finish the marathon faster than Women.

Degrees of Freedom: The degrees of freedom are approximately 69.816, calculated using Welch’s correction for unequal variances.

Hypotheses:

Null Hypothesis (H0): The mean winning time for Men is equal to the mean winning time for Women. ÎĽ(Men) = ÎĽ(Women)

Alternative Hypothesis (H1): The mean winning time for Men is different from the mean winning time for Women. μ(Men) ≠ μ(Women)

p-value: The p-value is extremely small (< 2.2e-16), indicating strong evidence against the null hypothesis. A low p-value indicates that the observed difference in mean winning times between Men and Women is highly unlikely to be due to random chance.

Confidence Interval: The 95% confidence interval (-0.01194160 to -0.00989228) provides a range where the true difference in mean finishing times between Men and Women is likely to be found. Since this range doesn’t include zero, it supports the idea that there is a significant difference.

Sample Estimates: The mean winning time for Men (0.08837516) is significantly different from the mean winning time for Women (0.09929210).

Conclusion: Given the very low p-value, we reject the null hypothesis. There is strong statistical evidence to suggest that there is a significant difference in mean winning times between Men and Women. The negative mean difference in the confidence interval indicates that, on average, Men have a lower winning time compared to Women.

# t-test for Wheelchair Men vs. Wheelchair Women
t_test_wheelchair <- t.test(Time ~ Category, data = subset(winners_data, Category %in% c("Wheelchair Men", "Wheelchair Women")))
print(t_test_wheelchair)
## 
##  Welch Two Sample t-test
## 
## data:  Time by Category
## t = -3.684, df = 66.574, p-value = 0.0004625
## alternative hypothesis: true difference in means between group Wheelchair Men and group Wheelchair Women is not equal to 0
## 95 percent confidence interval:
##  -0.6161296 -0.1830727
## sample estimates:
##   mean in group Wheelchair Men mean in group Wheelchair Women 
##                       1.761944                       2.161546

t-value: The t-value is -3.684, indicating a substantial difference between the mean winning times of Wheelchair Men and Wheelchair Women. The negative sign suggests that, on average, Wheelchair Women have a higher winning time.

Degrees of Freedom (df): The degrees of freedom are 66.574, calculated using Welch’s method.

Null Hypothesis (H0): The mean winning times for Wheelchair Men and Wheelchair Women are equal. Alternative Hypothesis (HA): The mean winning times for Wheelchair Men and Wheelchair Women are not equal.

p-value = 0.0004625: The p-value is less than 0.05, indicating that there is strong evidence to reject the null hypothesis.

95 percent confidence interval: The range -0.025672067 to -0.007628028 represents the range within which we can be 95% confident that the true difference in means lies. In this case, it suggests that the mean winning time for Wheelchair Women is likely to be between 0.0076 and 0.0256 higher than that of Wheelchair Men.

With a small p-value and a negative t-value, there is evidence to suggest that there is a statistically significant difference in winning times between Wheelchair Men and Wheelchair Women. The mean winning time for Wheelchair Women appears to be significantly higher than that for Wheelchair Men. The 95% confidence interval provides a range for the magnitude of this difference.

# ANOVA for Men vs. Women vs. Wheelchair Men vs. Wheelchair Women
anova_model <- aov(Time ~ Category, data = subset(winners_data, Category %in% c("Men", "Women", "Wheelchair Men", "Wheelchair Women")))
summary(anova_model)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## Category      3  8.036   2.679   24.34 4.81e-13 ***
## Residuals   161 17.714   0.110                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Null Hypothesis (H0): There is no significant difference in mean winning times among the categories.

Alternative Hypothesis (H1): There is a significant difference in mean winning times among at least one pair of categories.

I reject the null hypothesis, concluding that there are significant differences in mean winning times across at least one pair of categories (Men, Women, Wheelchair Men, Wheelchair Women).

F-Test: The F value of 24.34 is higher than expected by chance, signifying significant differences in mean winning times among categories.

p-value: With a tiny p-value (4.81e-13), there is substantial evidence against the null hypothesis, implying significant differences in mean winning times.

Explore potential correlations between winning times and other variables

# Explore potential correlations between winning times and other variables.

# Convert 'Nationality' to a factor variable
winners_data$Nationality <- as.factor(winners_data$Nationality)

# Calculate correlation coefficients
cor_year_time <- cor(winners_data$Year, winners_data$Time, method = "pearson")
cor_nat_time <- cor(as.numeric(winners_data$Nationality), winners_data$Time, method = "spearman")

print(paste("Correlation between Year and Time (Pearson):", cor_year_time))
## [1] "Correlation between Year and Time (Pearson): -0.464838452893853"
print(paste("Correlation between Nationality and Time (Spearman):", cor_nat_time))
## [1] "Correlation between Nationality and Time (Spearman): -0.13136760903489"

We observed a correlation between Year and Time (Pearson) of -0.4648, indicating a moderate negative linear relationship. As the Year increases, the Time tends to decrease moderately. The closer the correlation coefficient is to -1, the stronger the negative linear relationship.

Furthermore, the correlation between Nationality and Time (Spearman) is -0.1314, suggesting a weak negative monotonic relationship. As the ranks of Nationality increase, the Time tends to decrease weakly. The closer the Spearman correlation coefficient is to -1, the stronger the negative monotonic relationship.

Test whether there is difference of athletes’ performance based on nationality

Here, the null hypothesis is that \(H_0\): there is no difference of athletes’ performance based on nationality. The alternative hypothesis is that \(H_A\): there is difference of athletes’ performance based on nationality.

Anova test
winners_data$Time <- as.numeric(winners_data$Time)

# Performing ANOVA
anova_model <- aov(Time ~ Nationality, data = winners_data)
anova_summary <- summary(anova_model)

# Print ANOVA summary
print(anova_summary)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## Nationality  23  9.792  0.4257   3.761 5.51e-07 ***
## Residuals   141 15.959  0.1132                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Kruskal-Wallis test
# In case the assumptions of ANOVA are not met, use Kruskal-Wallis test as a non-parametric alternative
kruskal_test <- kruskal.test(Time ~ Nationality, data = winners_data)
print(kruskal_test)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Time by Nationality
## Kruskal-Wallis chi-squared = 79.787, df = 23, p-value = 3.437e-08

Here we can see that both test give a p-value that is close to 0. So there is evidence to reject the null hypothesis, and we conclude that there is difference of athletes’ performance based on nationality.

Summarize which country has the most winners based on category

winners_data |> group_by(Category,Nationality) |> summarise(count = n()) |> arrange(desc(count)) |> slice_head()
## # A tibble: 4 Ă— 3
## # Groups:   Category [4]
##   Category         Nationality    count
##   <chr>            <fct>          <int>
## 1 Men              Kenya             17
## 2 Wheelchair Men   United Kingdom    16
## 3 Wheelchair Women United Kingdom    15
## 4 Women            Kenya             14

Marathon(analysis & conclusions)

Overview of the dataset

head(london_marathon)
## # A tibble: 6 Ă— 8
##   Date        Year Applicants Accepted Starters Finishers Raised
##   <date>     <dbl>      <dbl>    <dbl>    <dbl>     <dbl>  <dbl>
## 1 1981-03-29  1981      20000     7747     7055      6255     NA
## 2 1982-05-09  1982      90000    18059    16350     15116     NA
## 3 1983-04-17  1983      60000    19735    16500     15793     NA
## 4 1984-05-13  1984      70000    21142    16992     15675     NA
## 5 1985-04-21  1985      83000    22274    17500     15873     NA
## 6 1986-04-20  1986      80000    25566    19261     18067     NA
## # ℹ 1 more variable: `Official charity` <chr>

Inst from the data

cat("missing value count:", sum(is.na(london_marathon)))
## missing value count: 43
missing_percentage <- colMeans(is.na(london_marathon)) * 100
plot_data <- data.frame(
  variable = names(missing_percentage),
  percentage = missing_percentage
)
ggplot(plot_data, aes(x = variable, y = percentage)) +
  geom_bar(stat = "identity", fill = "skyblue", width = 0.5) +
  labs(title = "Percentage of missing values in London Marathon", y = "Percentage") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

So we can see that there are some missing values regarding the London Marathon.

the Raised means amount raised for charity (ÂŁ millions), and the Official charity means Official charity.

So we can think that if Official charity is NULL, it means there is no Official charity, then Raised should be 0 (or remove these values when analyzing Raised). If Official charity is not empty, and Raised is empty, then Raised Non-public, then we can fill it in using the average of other data.

For the missing other fields, we can see that the fields are the same. According to the official documentation (github), the data for the last two years is missing, so we should filter out the data for these two year (about london_marathon).

london_marathon1 <- london_marathon %>% group_by(`Official charity`) %>% mutate(Raised=ifelse(is.na(Raised) & !is.na(`Official charity`), mean(london_marathon$Raised, na.rm=T), Raised))

london_marathon1 <- london_marathon1 %>% mutate(Raised=ifelse(is.na(`Official charity`), 0, Raised))

london_marathon1 <- london_marathon1 %>% mutate(`Official charity`=ifelse(is.na(`Official charity`), "None", `Official charity`))

london_marathon1 <- london_marathon1 %>% filter(Year < max(london_marathon1$Year) - 1)

cat("new missing value count:", sum(is.na(london_marathon1)))
## new missing value count: 0
ggplot(london_marathon1, aes(x = Year, y = Raised)) +
  geom_line() +
  labs(title = "Raised Over the Years in London Marathon",
       x = "Year",
       y = "Raised") +
  theme_minimal()

From the line chart, we can see that the Raised data we filled is not helpful in analyzing the changes in Raised data. If we want to analyze the changes in this part, we only need to focus on the data from 2007 to 2019.

filter_data <- london_marathon1 %>% filter(Year >= 2007 & Year <= 2019)
filter_data
## # A tibble: 13 Ă— 8
## # Groups:   Official charity [13]
##    Date        Year Applicants Accepted Starters Finishers Raised
##    <date>     <dbl>      <dbl>    <dbl>    <dbl>     <dbl>  <dbl>
##  1 2007-04-22  2007     128000    50039    36396     35729   46.5
##  2 2008-04-13  2008     120000    48630    35037     34637   46.7
##  3 2009-04-26  2009     155000    49995    35884     35404   47.2
##  4 2010-04-25  2010     163000    51378    36956     36666   50.6
##  5 2011-04-17  2011     163926    50532    35303     34872   51.8
##  6 2012-04-22  2012     170150    50200    37227     36812   52.8
##  7 2013-04-21  2013     167449    48323    34631     34381   53  
##  8 2014-04-13  2014     169682    49872    36337     35977   53.2
##  9 2015-04-26  2015     172888    51696    38020     37793   54.1
## 10 2016-04-24  2016     247069    53152    39523     39140   59.4
## 11 2017-04-23  2017     253930    53229    40048     39487   61.5
## 12 2018-04-22  2018     386050    54685    40926     40220   63.7
## 13 2019-04-28  2019     414168    56398    42906     42549   66.4
## # ℹ 1 more variable: `Official charity` <chr>
ggplot(filter_data, aes(x=Year, y=Raised, group=1))+
  geom_line() + 
  geom_point() + 
  theme()

It can be seen that fundraising increases every year.

It can be seen from this plot that the winning time of men and women in wheelchairs has become significantly faster, while the winning time of normal men and women has gradually become faster, indicating that people are gradually paying attention to the competitiveness of marathon sports.

# london marathon plot
london_plot <- london_marathon %>%
  filter(Year < 2020) %>%
  mutate(Year = factor(Year))

ggplot(
  data = london_plot,
  mapping = aes(y = Year)
) +
  geom_point(aes(x = Starters),
    colour = "#008080"
  ) +
  geom_point(aes(x = Finishers),
    colour = "#800080"
  ) +
  geom_segment(aes(
    x = Starters,
    xend = Finishers,
    y = Year,
    yend = Year
  )) +
  labs(
    x = "Number of runners",
    title = "Number of London Marathon Starters and Finishers"
  ) +
  theme_minimal() +
  theme(
    axis.title.y = element_blank(),
    plot.background = element_rect(fill = "white", colour = "white"),
    panel.background = element_rect(fill = "white", colour = "white")
  )

Analyze the dataset using linear models

In this part, we will use linear models to fit the data and predict the number of applicants, accepted, starters, and finishers in year 2024.

From the previous part, we can see that the number of accepted, starters, and finishers versus year are generally linear. So the models we will use are \[y_i=\beta_0+\beta_1x_i+\epsilon_i, \quad \epsilon_i \sim N(0,\sigma^2)\] For number of applicants, the model we we will use is \[y_i=\beta_0+\beta_1x_i+\beta_2x_i^2+\beta_3x_i^3+\epsilon_i, \quad \epsilon_i \sim N(0,\sigma^2)\]

Moreover, we will check the normality of the residuals.

For the number of applicants

fit1 <- lm(Applicants ~ Year + I(Year^2)+ I(Year^3), data = sub_london)
ggplot(sub_london, aes(x = Year, y = Applicants)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 3), se = FALSE, color = "blue") +
  theme_minimal() +
  labs(title = "Polynomial Regression Model", x = "Year", y = "Number of Applicants")

Prediction
new <- data.frame('Year'= 2024)
predict(fit1, newdata = data.frame(new), interval = c("confidence"), level = 0.95, type="response")
output
##      fit      lwr      upr
## 1 571826 498633.6 645018.4

The number of applicants is estimated to be 571826 in 2024. A \(95\%\) prediction interval is \((498633, 645018)\).

Model Checking
plot(fit1, which=1)
plot

plot(fit1, which=2)
plot

For the number of accepted

fit2 <- lm(Accepted ~ Year, data = sub_london)
ggplot(sub_london, aes(x = Year, y = Accepted)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", formula = 'y ~ x',se = FALSE, color = "blue") +
  theme_minimal() +
  labs(title = "SLR Model for Accepted", x = "Year", y = "Number of Accepted")

##### Prediction

new <- data.frame('Year'= 2024)
predict(fit2, newdata = data.frame(new), interval = c("confidence"), level = 0.95, type="response")
output
##        fit     lwr      upr
## 1 63762.82 60926.6 66599.03

The number of applicants is estimated to be 63762 in 2024. A \(95\%\) prediction interval is \((60926, 66599)\).

Model Checking
plot(fit2, which=1)
plot

plot(fit2, which=2)
plot

For the number of starters

fit3 <- lm(Starters ~ Year, data = sub_london)
ggplot(sub_london, aes(x = Year, y = Starters)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", formula = 'y ~ x',se = FALSE, color = "blue") +
  theme_minimal() +
  labs(title = "SLR Model for Starters", x = "Year", y = "Number of Starters")

##### Prediction

new <- data.frame('Year'= 2024)
predict(fit3, newdata = data.frame(new), interval = c("confidence"), level = 0.95, type="response")
output
##        fit      lwr      upr
## 1 46119.77 44331.73 47907.82

The number of applicants is estimated to be 46119 in 2024. A \(95\%\) prediction interval is \((44331, 47907)\).

Model Checking
plot(fit3, which=1)
plot

plot(fit3, which=2)
plot

For the number of finishers

fit4 <- lm(Finishers ~ Year, data = sub_london)
ggplot(sub_london, aes(x = Year, y = Finishers)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm",formula = 'y ~ x', se = FALSE, color = "blue") +
  theme_minimal() +
  labs(title = "SLR Model for Finishers", x = "Year", y = "Number of Finishers")

Prediction
new <- data.frame('Year'= 2024)
predict(fit4, newdata = data.frame(new), interval = c("confidence"), level = 0.95, type="response")
output
##        fit      lwr      upr
## 1 46056.64 44306.16 47807.13

The number of applicants is estimated to be 46056 in 2024. A \(95\%\) prediction interval is \((44306, 47807)\).

Model Checking
plot(fit4, which=1)
plot

plot(fit4, which=2)
plot

Create new variables and visulization

Let \(x\) denote the number of applicants in a year, \(y_1\) the number of accepted in a year, \(y_2\) the number of starters in a year, \(y_3\) the number of finishers in a year. We define the ratio of accepted as \(r_1=\frac{y_1}{x}\), ratio of starters as \(r_2=\frac{y_2}{y_1}\), ratio of finishers as \(r_3=\frac{y_3}{y_2}\).

london_marathon <- london_marathon |> mutate(`Ratio of Accepted`=Accepted/Applicants)|>mutate(`Ratio of Starters`=Starters/Accepted)|> mutate(`Ratio of Finishers`=Finishers/Starters)
print(london_marathon[, c("Year", "Ratio of Accepted", "Ratio of Starters","Ratio of Finishers")])
## # A tibble: 42 Ă— 4
##     Year `Ratio of Accepted` `Ratio of Starters` `Ratio of Finishers`
##    <dbl>               <dbl>               <dbl>                <dbl>
##  1  1981               0.387               0.911                0.887
##  2  1982               0.201               0.905                0.925
##  3  1983               0.329               0.836                0.957
##  4  1984               0.302               0.804                0.922
##  5  1985               0.268               0.786                0.907
##  6  1986               0.320               0.753                0.938
##  7  1987               0.355               0.757                0.912
##  8  1988               0.411               0.749                0.932
##  9  1989               0.441               0.770                0.928
## 10  1990               0.478               0.760                0.944
## # ℹ 32 more rows

Accepted to Started: Over the years, there is a general trend of a decline in the percentage of accepted participants who actually start the race.

Started to Finished:

The percentage of starters who finish the race seems to be relatively stable, generally ranging from the high 80s to mid-90s.

Notable Observation (2020):

In 2020, there’s a notable observation in the “Accepted to Started” column. The percentage is 100%, indicating that all accepted participants started the race. However, the “Started to Finished” percentage is lower (79.22%), indicating that not all starters completed the race.

This discrepancy could be due to the impact of external factors such as the COVID-19 pandemic, leading to a different pattern in that particular year.

sub_london1 <- london_marathon[1:39,]
p1 <- ggplot(data= sub_london1, aes(x=Year, y=`Ratio of Accepted` ))+geom_point()+geom_line()+
  labs(title = "Ratio of Accepted")
p2 <- ggplot(data= sub_london1, aes(x=Year, y=`Ratio of Starters` ))+geom_point()+geom_line()+
  labs(title = "Ratio of Starters")
p3 <- ggplot(data= sub_london1, aes(x=Year, y=`Ratio of Finishers` ))+geom_point()+geom_line()+
  labs(title = "Ratio of Finishers")
grid.arrange(p1, p2, p3, ncol=1)

Time series analysis and forecasting

In this part, we will focus on four time series, which are number of applicants each year, ratio of accepted each year, ratio of starters each year, and ratio of finishers each year. We will use ARIMA(AutoRegressive Integrated Moving Average) model to analyze and make forecasts for the time series in London marathon dataset. For each time series, we will draw a forecasting plot of 10 years, check the autocorrelation function of the residuals, check the normality of the residuals, and do a Box-Ljung test(a statistical test used to check whether any of a group of autocorrelations of a time series are different from zero).

# a function to plot the forecast errors
plotForecastErrors <- function(forecasterrors)
  {
     # make a histogram of the forecast errors:
     mybinsize <- IQR(forecasterrors)/4
     mysd   <- sd(forecasterrors)
     mymin  <- min(forecasterrors) - mysd*5
     mymax  <- max(forecasterrors) + mysd*3
     # generate normally distributed data with mean 0 and standard deviation mysd
     mynorm <- rnorm(10000, mean=0, sd=mysd)
     mymin2 <- min(mynorm)
     mymax2 <- max(mynorm)
     if (mymin2 < mymin) { mymin <- mymin2 }
     if (mymax2 > mymax) { mymax <- mymax2 }
     # make a red histogram of the forecast errors, with the normally distributed data overlaid:
     mybins <- seq(mymin, mymax, mybinsize)
     hist(forecasterrors, col="red", freq=FALSE, breaks=mybins)
     # freq=FALSE ensures the area under the histogram = 1
     # generate normally distributed data with mean 0 and standard deviation mysd
     myhist <- hist(mynorm, plot=FALSE, breaks=mybins)
     # plot the normal curve as a blue line on top of the histogram of forecast errors:
     points(myhist$mids, myhist$density, type="l", col="blue", lwd=2)
}

Analysis of number of applicants

Forecasting the number of applicants in next 10 years
app <- london_marathon$Applicants[1:40]
appts <- ts(app, ,frequency=1,start=1981)
appts <- na.omit(appts)

# fit the an automatic arima model
apptsarima <- auto.arima(appts)
# make forcast about the number of applicants in 10 years
fc <- forecast(apptsarima, h=10)
plot(fc, xlab = "Year", ylab= "Number of Applicants", main = "Forecasts of Number of Applicants")

The forecasting plot shows that there will be about 6000 annual increase of number of applicants for London marathon in the next 10 years.

Checking ACF of the residuals
# check autocorrelation function with lag from 1 to 20
acf(fc$residuals, lag.max=20,main="ACF Check for Resisuals")
plot

Box-Ljung test for residuals
# Box-Ljung test
Box.test(fc$residuals, lag=20, type="Ljung-Box")
output
## 
##  Box-Ljung test
## 
## data:  fc$residuals
## X-squared = 5.7183, df = 20, p-value = 0.9992
Checking normality of the residuals
# histogram of forecasterrors
plotForecastErrors(fc$residuals)
plot

Analysis of ratio of accepted

Forecasting ratio of accepted in next 10 years
roa <- london_marathon$`Ratio of Accepted`
roats <- ts(roa, ,frequency=1,start=1981)
roats<- na.omit(roats)

# fit the an automatic arima model
roatsarima <- auto.arima(roats)
# make forcast about the number of applicants in 10 years
fc2 <- forecast(roatsarima, h=10)
plot(fc2,xlab = "Year", ylab= "Number of Applicants", main = "Forecasts of Ratio of Accepted")

The forecasting plot shows that the ratio of accepted for London marathon in the next 10 years will generally remain constant.

Checking ACF of the residuals
# check autocorrelation function with lag from 1 to 20
acf(fc2$residuals, lag.max=20,main="ACF Check for Resisuals")
plot

Box-Ljung test for residuals
# Box-Ljung test
Box.test(fc2$residuals, lag=20, type="Ljung-Box")
output
## 
##  Box-Ljung test
## 
## data:  fc2$residuals
## X-squared = 10.473, df = 20, p-value = 0.9588
Checking normality of the residuals
# histogram of forecasterrors
plotForecastErrors(fc2$residuals)
plot

Analysis of ratio of starters

Forecasting ratio of starters in next 10 years
ros <- london_marathon$`Ratio of Starters`
rosts <- ts(ros, ,frequency=1,start=1981)
rosts <- na.omit(rosts)
rostsarima <- auto.arima(rosts)
fc3 <- forecast(rostsarima, h=10)
plot(fc3,xlab = "Year", ylab= "Number of Applicants", main = "Forecasts of Ratio of Starters")

The forecasting plot shows that the ratio of accepted for London marathon in the next 10 years will decrease.

Checking ACF of the residuals
acf(fc3$residuals, lag.max=20,main="ACF Check for Resisuals")
plot

Box-Ljung test for residuals
Box.test(fc3$residuals, lag=20, type="Ljung-Box")
output
## 
##  Box-Ljung test
## 
## data:  fc3$residuals
## X-squared = 8.5628, df = 20, p-value = 0.9874
Checking normality of the residuals
plotForecastErrors(fc3$residuals)
plot

Analysis of ratio of finishers

Forecasting the ratio of finishers in next 10 years
rof <- london_marathon$`Ratio of Finishers`
rofts <- ts(rof, ,frequency=1,start=1981)
rofts <- na.omit(rofts)
roftsarima <- auto.arima(rofts)
fc4 <- forecast(roftsarima, h=10)
plot(fc4,xlab = "Year", ylab= "Number of Applicants", main = "Forecasts of Ratio of Finishers")

The forecasting plot shows that the ratio of finishers for London marathon in the next 10 years will remain constant.

Checking ACF of the residuals
acf(fc4$residuals, lag.max=20,main="ACF Check for Resisuals")
plot

Box-Ljung test for residuals
Box.test(fc4$residuals, lag=20, type="Ljung-Box")
output
## 
##  Box-Ljung test
## 
## data:  fc4$residuals
## X-squared = 2.745, df = 20, p-value = 1
Checking normality of the residuals
plotForecastErrors(fc4$residuals)
plot

Since the correlograms in models above show that none of the sample autocorrelations for lags 1-20 exceed the significance bounds, and the p-values for the Ljung-Box test are close to 1, we can conclude that there are very little evidences for non-zero autocorrelations in the forecast errors at lags 1-20.

The histograms of the time series show that the forecast errors are roughly normally distributed and the means seem to be close to zero. Therefore, it is plausible that the forecast errors are normally distributed with mean zero and constant variance.

Since successive forecast errors do not seem to be correlated, and the forecast errors seem to be normally distributed with means zeros and constant variances, the ARIMA(s) do seem to provide adequate predictive models for the dataset.

Clarification

Though there are some overlaps between our team members works, we came up the question independently and did the analysis independently. The report is an integration of our work. Here are the jobs that we did:

Longxiang Wu

data wrangling part
winner dataset
- plot of time versus year
- boxplots of time based on category
- analysis of difference of athletes’ performance based on nationality
- analysis of difference between performance of men and women
- summarising coutries that have the most winners
london_marathon dataset
- plots of variables(applicants,accepted,starters,finishers) versus year
- creating and plots of new variables
- linear models for the dataset
- time series analysis and forecasting
integrated team’s works and composed report V1.0

Shengle Yan